---
title: Data FAQ
dataset_name: N/A
description: Provides a list, with brief answers, of frequently asked data preparation and management questions. Answers links to more complete documentation.

---

# Data FAQ {: #data-faq }

??? faq "What is the AI Catalog?"
    The [AI Catalog](ai-catalog/index) is a DataRobot tool for importing, registering, and sharing data and other assets. The catalog supports browsing and searching registered assets, including definitions and relationships with other assets.

??? faq "What is Data Prep?"
    [Data Prep](companion-tools/index) is a DataRobot tool for cleaning and transforming data to be used in machine learning. Data Prep lets you prepare data from [multiple sources](companion-tools/index). You can save and share your data, as well as the steps used to prepare it.

??? faq "What file types can DataRobot ingest?"
    DataRobot can ingest text, Excel, SAS, and various compressed or archive files. [Supported file formats](file-types#data-formats) are listed at the bottom of the project (Start) page. You can [import files directly into DataRobot](import-to-dr) or you can [import them into the AI Catalog](catalog).

??? faq "What data sources can DataRobot connect to?"
    DataRobot can ingest from [JDBC-enabled data sources](data-conn), as well as S3, Azure Blob, Google Cloud Storage, and URLs, among others.

??? faq "What is a histogram used for?"
    [Histograms](histogram#histogram-chart) bucket numeric feature values into equal-sized ranges to show a rough distribution of the variable (feature). Access a feature's histogram by expanding the feature in the **Data** tab.

??? faq "What do yellow triangles mean on the **Data** tab?"
    Upon uploading data, DataRobot automatically detects and identifies common data quality issues. The [Data Quality Assessment](data-quality) report denotes these data quality issues with yellow triangle warnings. Hover over the triangles to see the specific quality issues, such as excess zeros or outliers.

??? faq "How can I share a dataset?"
    Use the AI Catalog to [share a dataset](sharing) with users, groups, and organizations.  You can select a role for the users who will share the asset&mdash;they can be an owner (can view, edit, and administer), an editor (can view and edit), or a consumer (can view).

??? faq "How does DataRobot reduce features?"
    DataRobot automatically implements feature reduction at multiple stages of the modeling life cycle:

    1. During [EDA1](eda-explained#eda1): After uploading your data, DataRobot creates an informative feature list by excluding non-informative features, such as those with too many unique values.
    2. After [EDA2](eda-explained#eda2): After clicking Start, DataRobot removes features with target leakage (i.e., features with a high correlation to the target) and features with an ACE score less than 0.0005 (i.e., features with a marginal correlation to the target).
    3. During model training and analysis: DataRobot removes all redundant features and retrains the model, keeping features with a cumulative feature importance score over 0.95.
    4. A step in the model's blueprint: Some algorithms offer intrinsic feature reduction, including LASSO and ENET, by shrinking coefficients to 0.5.
    5. [Automated Feature Discovery](fd-gen): Feature Discovery projects explore and generate features based on the secondary dataset(s), and then perform [supervised feature reduction](fd-overview#feature-reduction) to only keep features with an estimated cumulative feature importance score over 0.98.

    For more information, see the [documentation for data transformations](transform-data/index).

??? faq "What are informative features?"
    Informative features are those that are potentially valuable for modeling. DataRobot generates an [informative features list](feature-lists#automatically-created-feature-lists) where features that will not be useful are removed. Some examples include reference IDs, features that contain empty values, and features that are derived from the target. DataRobot also creates features, such as date type features, and if valuable, includes them in the informative features list.

??? faq "What is a snapshot?"
    You can create a *[snapshot](catalog#create-a-snapshot)* of your data in the AI Catalog, in which case DataRobot stores a copy of your data in the catalog. You can then [schedule the snapshot](snapshot) to be refreshed periodically. If you don't create a snapshot, the data is *dynamic*&mdash;DataRobot samples for profile statistics but does not keep a copy of the data. Instead, the catalog stores a pointer to the data and pulls it upon request, for example, when you create a project.

??? faq "What are the green "importance" bars on the **Data** tab?"
    The [importance bars](model-ref#importance-score) show the degree to which a feature is correlated with the target. These bars are based on "Alternating Conditional Expectations" (ACE) scores which detect non-linear relationships with the target, but are unable to detect interaction effects between features. Importance measures the information content of the feature; this calculation is done independently for each feature in the project.

??? faq "How large can my datasets be?"
    [File size requirements](file-types#ensure-acceptable-file-import-size) vary depending on deployment type (Cloud versus on premise) and whether you are using [AutoML](file-types#automl-file-import-sizes), [time series](file-types#time-series-file-import-sizes), and/or [Feature Discovery](file-types#feature-discovery-file-import-sizes).

??? faq "How do I remove rows and columns from my dataset?"
    You can use the [Data Prep](companion-tools/index) tool to remove rows or columns from your dataset. If you have the same data in multiple rows, you can [deduplicate](companion-tools/index). You can use a [Filtergram](companion-tools/index) to select rows for removal and you can use the [Columns tool](companion-tools/index) to remove columns.
